Claude/web crawler python3 modernize n21pe by jmg · Pull Request #21 · jmg/crawley

jmg · 2026-06-05T19:00:01Z

No description provided.

…itemaps Closes the main "framework" gaps vs Scrapy, built on the existing async engine (httpx, retries, rate limiting, robots.txt, de-duplication). - crawley.spider: `Request` (callback, meta, cb_kwargs, headers, priority, dont_filter, errback, fingerprint/replace), `Item`, and a callback-driven `Spider` (parse/start_requests/on_item, depth tracking, fingerprint de-dup). - response.follow()/response.meta and response.request for list->detail crawls. - crawley.pipelines: `ItemPipeline` + `DropItem`; spiders run items through the pipeline chain (open_spider/close_spider/process_item, sync or async). - crawley.spiders: `LinkExtractor` (allow/deny/restrict_xpaths/restrict_css), `Rule`, `CrawlSpider` (rule-based following) and `SitemapSpider` (sitemap.xml + sitemap index). - RequestManager.make_request accepts per-request headers. - Extractors parse from bytes so XML-with-declaration (sitemaps) is handled. Tests (169 -> 180): test_spider, test_spiders; conftest serves /sitemap.xml.

…ders - crawley.http.playwright.PlaywrightRequestManager: render pages with a headless browser (lazy import), with per-host throttling and retries; wired into the engine via `render_js = True` and `playwright_options`. Optional extra `crawley[js]`. - Docs: new "Spiders" page (Request/callbacks/follow, item pipelines, CrawlSpider/LinkExtractor, SitemapSpider, JS rendering); API reference and nav updated. - examples/06_spider.py (callback spider + pipeline), indexed and test-covered. - README "Spiders" section; CHANGELOG updated. Tests (180 -> 187): test_playwright (render path mocked, no browser needed) plus the spider example.

- crawley.stats.StatsCollector: per-crawl counters (requests, responses, status/<code>, request_errors, robots_blocked, items/items_dropped, elapsed), exposed as crawler/spider `stats` and logged on finish. - crawley.http.cache.HttpCache: on-disk response cache keyed by method+url+body. Enable with `http_cache = True` / `http_cache_dir`; wired into RequestManager, FastRequestManager and the Playwright manager. - crawley.spider.FormRequest + FormRequest.from_response(): read a <form>, pre-fill inputs/selects/textareas, honour its method (GET -> query string), override fields via formdata. Docs (crawler stats/cache, spiders forms, API reference) and CHANGELOG updated. Tests (187 -> 201): test_stats, test_cache, test_forms; conftest serves /login-form.

- crawley.middlewares.DownloaderMiddleware: process_request / process_response / process_exception chains (sync or async) wrapping every Spider download. process_request may short-circuit with a Response or reschedule a Request; process_exception can recover from errors. - crawley.http.autothrottle.AutoThrottle: adapt the per-host delay to the observed response latency (target_concurrency, start/max delay). Enable with `autothrottle = True`; Response now carries `.latency` (httpx elapsed / measured render time), fed to the per-host rate limiter. Docs (spiders middlewares, politeness AutoThrottle, API reference) and CHANGELOG updated. Tests (201 -> 213): test_middlewares, test_autothrottle.

claude added 4 commits June 5, 2026 18:23

jmg merged commit fa0bf52 into master Jun 5, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/web crawler python3 modernize n21pe#21

Claude/web crawler python3 modernize n21pe#21
jmg merged 4 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe

jmg commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmg commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants